Back

Nature Computational Science

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Nature Computational Science's content profile, based on 50 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.

1
Scalable deep-learning-based inference of time-varying transmission dynamics from outbreak phylogenies

XIE, R.; Zhukova, A.; Pena, P. G.; Iglesias, G.; Hu, S.; Wang, J.; Tsang, T. K.; Dhanasekaran, V.; Kraemer, M. U. G.; Pybus, O. G.; Gascuel, O.

2026-05-10 infectious diseases 10.64898/2026.05.07.26352673 medRxiv
Top 0.1%
22.5%
Show abstract

Infectious disease dynamics can be inferred from pathogen genomic data using phylodynamic methods, but the applicability of many such approaches to large data sets is constrained by computational cost. Recent deep-learning approaches to phylodynamics have improved scalability, yet challenges remain when genetic divergence is limited during fast spreading outbreaks. To address this, we use pathogen-specific models to show that deep-learning models trained on outbreak-like phylogenies can accurately estimate the reproductive number (R) when both the birth-death model and the expected phylogenetic resolution are matched to the target pathogen, highlighting the importance of realistic training conditions. Focusing on three major respiratory pathogens of public health importance (SARS-CoV-2, seasonal human influenza virus, and respiratory syncytial virus (RSV)), we introduce PhyloRt, a scalable framework for estimating the time-varying reproductive number (Rt) from large outbreak phylogenies. PhyloRt decomposes large trees into overlapping subtrees and applies a hierarchical deep-learning-based inference strategy to classify subtrees as exhibiting constant or time-varying reproduction numbers, enabling identifiable and computationally efficient estimation of Rt as a piecewise-constant trajectory through time. Applications to SARS-CoV-2 and influenza outbreaks show that PhyloRt recovers transmission dynamics consistent with estimates derived from mathematical epidemiological and Bayesian phylodynamic analyses. Our work enables scalable and rapid estimation of time-varying transmission dynamics from very large-scale outbreak genomic data sets, supporting real-time genomic epidemiology of emerging pathogens. SignificanceEstimating changes in transmission dynamics over time is important for responding to infectious disease outbreaks. Current methods mostly rely on reported case data from epidemiological surveillance, which can be biased or incomplete due to variable testing capabilities, particularly in resource-limited settings. A complementary approach is to use viral genomes as an alternative data source. However, inferences from genomic data can be computationally intensive and have mainly been applied retrospectively. We present PhyloRt, a scalable deep-learning-based phylodynamic framework that enables fast inference of the time-varying reproductive number (Rt) from large outbreak phylogenies. Our approach is widely applicable and provides a practical approach to monitoring epidemic dynamics, complementing traditional surveillance and supporting timely public health decision-making.

2
Personalized Feature Statistics: Individual-Level Variant Inference under Genetic Ancestry Continuum

Wang, J. F.; Yu, R.; Edelson, J.; Park, J.; Le Guen, Y.; Liu, X.; Belloy, M.; Ionita-Laza, I.; Greicius, M.; Tang, H.; He, Z.

2026-04-29 neurology 10.64898/2026.04.28.26351879 medRxiv
Top 0.1%
18.2%
Show abstract

Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with complex diseases. However, the extent to which the effects of these variants vary across populations of diverse ancestries remains poorly understood. Furthermore, in these contexts genetic ancestry is treated as a categorical variable, thereby oversimplifying its continuous nature and the more nuanced ways in which it can influence genetic effects on disease. Here, we propose personalized feature statistics (PFstatistics), a statistical framework that quantifies the importance of genetic variants to a phenotype based on each individuals ancestry background, and profiles heterogeneous genetic effects across the genetic ancestry continuum. We demonstrate the utility of this framework through both simulations and real data analysis using sequencing data from ancestrally diverse cohorts in the Alzheimers Disease Sequencing Project (ADSP). We show that Alzheimers Disease (AD) risk variants span a spectrum from ancestry-homogeneous to ancestry-dependent effects, and that PFstatistics characterizes this spectrum at individual resolution across the ancestry continuum. PFstatistics also provides individual-level variant selection with FDR controlled at a target level, yielding distinct selection sets that vary across individuals according to their ancestry background. While demonstrated in the context of genetic ancestry, the proposed method is broadly applicable to other heterogeneity features such as environmental factors, offering a robust tool for understanding complex genetic contributions across diverse populations.

3
Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.

2026-03-16 health informatics 10.64898/2026.03.11.26348077 medRxiv
Top 0.1%
10.1%
Show abstract

Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.

4
From Prefix to Path: Learning Temporally Consistent Biomolecular Dynamics from Limited Initial Data

Choudhuri, S.; Adhikari, S.; Mondal, J.

2026-03-05 biophysics 10.64898/2026.03.02.709204 medRxiv
Top 0.1%
9.8%
Show abstract

Molecular dynamics (MD) simulations provide detailed insights into biomolecular motion but are often limited by the prohibitive cost of sampling long-timescale behavior. Here, we present a Transformer-based framework that reconstructs temporally continuous dynamical trajectories from only a small fraction of the initial data, directly targeting time-ordered evolution rather than independent ensemble snapshots. Using three systems spanning distinct dynamical regimes (intrinsically disordered -Synuclein, Cytochrome P450 ligand-binding motion, and a synthetic three-well potential), we show that the model learns both local fluctuations and long-range temporal structure. At inference time, the model generates full trajectories autoregressively from an initial prefix as prompt, capturing metastable transitions, basin-to-basin movements, and system-specific dynamical signatures. Free-energy surfaces computed from generated trajectories closely match ground-truth landscapes and, in several cases, we observe enhanced sampling in generated trajectories relative to the trained trajectories--while preserving kinetically meaningful transition patterns. These results demon-strate that Transformer architectures can serve as efficient, system-agnostic tools for time-continuous molecular trajectory prediction, offering a data-driven complement to long MD simulations and enabling accelerated exploration of conformational space.

5
TyCHE enables time-resolved lineage tracing of heterogeneously-evolving populations

Fielding, J. J.; Wu, S.; Melton, H. J.; Wang, C. Z.; Fisk, N.; du Plessis, L.; Hoehn, K. B.

2026-05-19 immunology 10.1101/2025.10.21.683591 medRxiv
Top 0.1%
9.7%
Show abstract

Phylogenetic methods for cell lineage tracing have driven significant insights into organismal development, immune responses, and tumor evolution. While most methods estimate mutation trees, time-resolved lineage trees are more interpretable and could relate events like cellular migration and differentiation to perturbations like vaccines and drug treatments. However, somatic mutation rates vary dramatically by cell type, significantly biasing existing methods. We introduce TyCHE (Type-linked Clocks for Heterogeneous Evolution), a Bayesian phylogenetics package that infers time-resolved phylogenies of populations with distinct evolutionary rates. We demonstrate that TyCHE improves tree accuracy using a new simulation package SimBLE (Simulator of B cell Lineage Evolution). We use TyCHE to infer patterns of memory B cell differentiation during HIV infection, dynamics of recall germinal centers following influenza vaccination, evolution of a glioma tumor lineage, and progression of a bacterial lung infection. TyCHE and SimBLE are available as open-source software packages compatible with the BEAST2 and Immcantation ecosystems.

6
Bayesian Nonparametrics for Normative Modelling in Multiple Sclerosis via Modularised Inference

Taschler, B.; Nichols, T. E.; Ganjgahi, H.

2026-05-15 radiology and imaging 10.64898/2026.05.10.26352835 medRxiv
Top 0.1%
8.7%
Show abstract

Normative models produce per-subject deviation scores that feed directly into downstream analyses, but typical pipelines (i) treat confounders with ad-hoc or purely linear adjustments, and (ii) pass point estimates of deviation scores directly to the downstream model, ignoring uncertainty. We propose an integrated, two-module Bayesian framework that aims to address both limitations. A normative module based on Bayesian Additive Regression Trees (BART) flexibly captures non-linear effects and higher-order interactions while marginalising over image-quality variables via counterfactual averaging. Crucially, we define individual deviation as di = E[Y|Xi,Zi] - (Zi) with (Z) the feature-conditional population mean, not as a residual. A SoftBART survival model then ingests the full posterior distribution of deviation scores via a cut-posterior construction, propagating upstream uncertainty while blocking feedback from the outcome model. Across challenging simulations and a large clinical data set of multiple sclerosis patients (N>8k), the integrated approach yields better calibration, prediction accuracy and time-varying hazard separation between groups than a two-step plug-in Cox regression model. Modularised inference with BART-based normative deviations improves both flexibility and uncertainty quantification, and extends naturally to other outcomes beyond survival.

7
Shannon Entropy Trajectories Reveal Between-Arm Distributional Structure Invisible to Standard Endpoint Analysis in Pooled ALS Clinical Trials

Rodriguez, A. M.; The Pooled Resource Open-Access ALS Clinical Trials Consortium,

2026-04-22 neurology 10.64898/2026.04.20.26351319 medRxiv
Top 0.1%
8.6%
Show abstract

Standard analysis of amyotrophic lateral sclerosis (ALS) clinical trials evaluates therapeutic efficacy by comparing linear slopes of total ALS Functional Rating Scale (ALSFRS) scores between treatment arms. This approach compresses multidomain ordinal data into a single scalar trajectory, discarding distributional structure. When subgroup-level trends differ in timing or direction, such aggregation can attenuate or eliminate them, a phenomenon known as Simpsons paradox. Here we apply Shannon entropy, computed from item-level score distributions within each ALSFRS functional domain following the framework established in [8], to the PRO-ACT database, stratified by treatment arm (Active: n = 4,581; Placebo: n = 2,931; 19 monthly time points). The entropy trajectories of drug-treated and placebo populations diverge visibly and systematically across all four functional domains (Bulbar, Fine Motor, Gross Motor, Respiratory). In the Fine Motor domain, the placebo population reaches peak entropy at month 8 and reverses, while the active population does not peak until month 13, a five-month delay in the populations transit toward functional loss. This divergence is model-independent: it is present in the raw Shannon entropy trajectories before any dynamical model is applied. A permutation test shuffling patient-level arm labels (n = 1,000 permutations) confirms that the total integrated absolute divergence across all four domains exceeds the null distribution at p < 0.001 (observed: 4.48; null: 2.03 {+/-} 0.33; 7.5 standard deviations above the null mean), with Fine Motor (p = 0.001) and Respiratory (p < 0.001) individually significant. The quantity that differs between arms, the shape and timing of the populations distributional evolution, does not exist as a measurable quantity in the total-score linear-slope framework used to evaluate these trials. Whether this signal reflects genuine treatment effects, compositional artifacts from pooling heterogeneous trials, or both cannot be determined from the anonymized public database alone. What can be determined is that the standard ALS clinical trial endpoint makes an implicit assumption, that the distributional information it discards is uninformative, and the present results demonstrate empirically that this assumption is false.

8
Sequence Design and Phylogenetic Inference with Generative Flow Networks

Huang, Q.; Mourra-Diaz, C. M.; Wen, X.; Payette, D.

2026-04-09 synthetic biology 10.64898/2026.04.08.717239 medRxiv
Top 0.1%
8.3%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWPhylogenetic inference remains computationally challenging due to the exponentially growing tree topology search space, and current methods rely heavily on multiple sequence alignments (MSAs) which are expensive and error-prone. We propose AncestorGFN, a proof-of-concept approach leveraging Generative Flow Networks (GFlowNets) for simultaneous sequence generation and phylogenetic exploration without requiring explicit MSAs. Our method learns to generate sequences matching a target distribution while the flow trajectories implicitly encode structural relationships among sequences. We demonstrate that greedy traceback on maximum-flow trajectories recovers shared intermediate states suggestive of common ancestry, and evaluate on the let-7 microRNA family where the learned flow structure qualitatively captures phylogenetic branching patterns. Furthermore, beam search at inference time discovers novel sequences clustering near known targets, suggesting applications in de novo sequence design. This work establishes an initial foundation for alignment-free phylogenetic exploration using generative models.

9
Individualized Functional Deviation Mapping: Linking Heterogeneous Structural Atrophy to Convergent Network Disruption in Preclinical Alzheimer's Disease

Tellaetxe-Elorriaga, I.; Jimenez-Marin, A.; Diez, I.; Erramuzpe, A.; Cortes, J. M.

2026-05-13 radiology and imaging 10.64898/2026.05.11.26352893 medRxiv
Top 0.1%
7.0%
Show abstract

The preclinical phase of Alzheimers disease (AD) is characterized by profound biological and structural heterogeneity, challenging our ability to map early pathology onto large-scale brain networks. To address this fundamental challenge, we introduce Functional Deviation Maps ({pi}z), an individualized neuroimaging framework for mapping participant-specific functional architecture to their unique structural atrophy landscape. By fitting a normative model to the voxel-based morphometry of amyloid-negative individuals, we extract personalized "atrophy seeds" (W-scores [&le;] -1.96) for amyloid-positive patients, subsequently obtaining their resting-state seed-based connectivity (SBC). By standardizing these participant-level SBC maps against a healthy reference distribution, we show that, despite the highly variable spatial origins of structural atrophy, individual functional deviations converge into a common "atrophy network". Spatial enrichment analyses show that the functional disruption is not random, but preferentially is dominated by the Default Mode Network. Furthermore, by projecting these populational functional deviations onto high-order cognitive topographies, we find a considerable alignment with the brains fundamental unimodal-transmodal and external-internal attentional gradients. Overall, the{pi} z framework transcends conventional group-level averages, offering a highly personalized, biologically meaningful signature of system-level network vulnerability in the earliest stages of AD.

10
DAMPA - accelerated and simplified design of probe panels for targeted metagenomics using pangenome graphs

Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.

2026-05-22 infectious diseases 10.64898/2026.05.15.26352859 medRxiv
Top 0.1%
7.0%
Show abstract

Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.

11
Deciphering antigen-driven T cell responses through vectorized TCRdist sequence neighborhood quantification

Valkiers, S.; Mayer-Blackwell, K.; Yeh, A. C.; Van Deuren, V. M. L.; Fiore-Gartland, A.; Hill, G.; Laukens, K.; Meysman, P.; Bradley, P.

2026-04-14 immunology 10.64898/2026.04.10.717405 medRxiv
Top 0.1%
6.9%
Show abstract

T cells provide precise mechanisms to defend the body against infection and malignancies, mediated through the expression of their hypervariable T cell receptors (TCRs). Interpreting similarity between TCRs, however, remains a significant challenge. While performant clustering methods exist, these often fail to distinguish between antigen-driven convergent selection and patterns arising stochastically from biases in the V(D)J recombination mechanism. Moreover, defining enrichment in sequence similarity among large repertoires is computationally taxing. To address these limitations, we present an efficient computational framework for rapid approximation of TCRdist distances using fixed-length vector embeddings and highly optimized nearest neighbor search, allowing sequence similarity enrichment testing at a multi-repertoire-wide scale. This framework leverages a novel shuffling-based background model that preserves important repertoire characteristics such as V gene frequency, CDR3 sequence length and generation probability more accurately than synthetic models. Together, these tools enable the efficient and robust identification of significantly neighbor enriched (SNE) TCR sequences at scale. We validate this approach by showing a significant enrichment of SNE clones in memory T cell fractions and further demonstrate its utility in identifying convergent T cell signatures of response to vaccination and viral infections, providing a scalable approach for antigen-agnostic T cell response profiling.

12
PrivateBoost: Privacy-Preserving Federated Gradient Boosting for Cross-Device Medical Data

Specht, B.; Garbaya, S.; Ermis, O.; Schneider, R.; Chavarriaga, R.; Khadraoui, D.; Tayeb, Z.

2026-03-10 health informatics 10.64898/2026.02.10.26345891 medRxiv
Top 0.1%
6.5%
Show abstract

Cross-device medical federated learning where individual patients participate directly rather than institutions poses a unique challenge: each client holds only a few samples, often just one (e.g., a single diagnostic record), leaving insufficient local data for gradient computation. Existing approaches, such as Secure Aggregation, require client-to-client coordination impractical for intermittently available mobile devices, while homomorphic encryption-based alternatives introduce sophisticated key management and coordination requirements ill-suited to dynamic cross-device deployments. We present privateboost, a federated XGBoost system that addresses this setting through m-of-n Shamir secret sharing with commitment-based anonymous aggregation. Clients distribute shares to a fixed set of shareholders requiring no client-to-client communication and the aggregator reconstructs only aggregate gradient sums via Lagrange interpolation, never observing individual values or client identities. We evaluate on UCI medical datasets, demonstrating 98% split gain retention relative to centralized XGBoost and accuracy resilient to up to 80% client dropout.

13
A biologically annotated neural network for proteomic discovery in Parkinsons disease

Vijayaraghavan, A.; Crawford, L.; Krishnakant, S.; Amini, A. P.; Conard, A. M.; Olsen, A. L.; Chahine, L. M.; Severson, K. A.

2026-04-30 neurology 10.64898/2026.04.29.26351681 medRxiv
Top 0.1%
6.5%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine learning models that can utilize high-dimensional data to make predictions and derive biological insights can improve understanding of diseases. Here, we develop a biologically annotated neural network model for proteomics data (P-BANN) which has several practical advantages: (1) it incorporates known relationships between proteins and signaling pathways into its architecture design; (2) it uses Bayesian principles to enable variable selection on the most important proteins for a disease of interests; and (3) it combines structured and black-box variational inference to analyze different classes of phenotypes at scale. To demonstrate the value of the approach, we apply P-BANN to one of the most common neurodegenerative disorders: Parkinsons disease (PD). We consider two biomarker-defined phenotypes within the PD population: presence of neuronal-predominate aggregated -synuclein in cerebrospinal fluid, and changes in dopamine transporter binding in the striatum on imaging. By considering biomarkers of both neuropathological hallmarks of PD, we can examine the extent to which their underlying biology is connected. Using the P-BANN framework, we discover sparse, statistically-calibrated sets of proteins which map to pathways, enabling more straightforward interpretation and generation of testable hypotheses.

14
Ollivier Ricci Curvature as a Geometric Biomarker for Biomedical Networks: From Ontology to Comorbidity Aging Trajectories

Agourakis, D. C.; Gerenutti, M.

2026-03-16 health informatics 10.64898/2026.03.14.26348393 medRxiv
Top 0.1%
6.4%
Show abstract

Network geometry offers a principled lens for understanding the structure of biomedical knowledge. We apply exact Ollivier-- Ricci curvature (ORC) -- a discrete analogue of Riemannian curvature computed via optimal transport -- to medical ontologies, disease comorbidity networks, biological interaction networks, and brain functional connectivity graphs. Three main results emerge. First, within a single database (the Human Phenotype Ontology), the formal IS-A taxonomy is hyperbolic ([Formula], tree-like), while the disease co-occurrence network is spherical ([Formula], clique-rich) -- a six-order-of-magnitude gap in the density parameter that the curvature phase transition framework predicts without free parameters. Second, age-stratified disease comorbidity networks from 8.9 million Austrian hospital patients reveal a geometric aging trajectory: mean ORC increases monotonically from [Formula] (age 20-30) to [Formula] (age 80+), driven by rising clustering and density that encode the accumulation of multimorbidity. Third, sedenion ([R]16) Mandel-brot orbit features -- exploiting the zero-divisor structure of the Cayley-Dickson tower -- discriminate ASD-like from ADHD-like brain network topology (AUROC = 0.990, sedenion-only), providing complementary geometric information to ORC. Canonical biological networks (C. elegans neural, E. coli gene regulatory, protein-protein interaction) are uniformly spherical, suggesting that evolved biological networks universally favour redundant, triangle-rich connectivity. All core mathematical claims are machine-verified in Lean 4 (0 sorry in 7 core modules). These results establish ORC as a quantitative geometric biomarker for biomedical network analysis and demonstrate that the same phase transition framework governing semantic networks extends to clinical and biological domains.

15
Vision-language framework for multi-sequence brain magnetic resonance imaging

Lteif, D.; Jia, S.; Bit, S.; Kaliaev, A.; Mian, A. Z.; Small, J. E.; Mangaleswaran, B.; Plummer, B. A.; Bargal, S. A.; Au, R.; Kolachalama, V. B.

2026-04-04 radiology and imaging 10.64898/2026.03.30.26349106 medRxiv
Top 0.1%
6.4%
Show abstract

Structural magnetic resonance imaging (MRI) is a cornerstone for diagnosing neurological disorders, yet automated interpretation of multi-sequence brain MRI remains limited by challenges in cross sequence reasoning and protocol variability. Here we present ReMIND, a vision-language modeling framework tailored for comprehensive multi-sequence and multi volumetric brain MRI analysis. Trained on over 73,000 deidentified patient visits encompassing more than 850,000 MRI sequences paired with radiology reports from diverse clinical and research cohorts, ReMIND combined large scale instruction tuning on more than one million clinically grounded question answer (QA) pairs with targeted supervised fine-tuning for radiology report generation. At inference, ReMIND employed modality aware reranking and correction, a report level decoding strategy that suppressed unsupported modality claims while preserving linguistic fluency and clinical coherence. Cross-cohort generalization was maintained on independent external datasets from different institutions. These findings represent an advance toward consistent and equitable brain MRI interpretation, meriting prospective evaluation to support diagnosis and management of neurological conditions.

16
TopBrain Segmentation Challenge for Whole Brain Vessel Anatomy

Yang, K.; Shi, P.; Huang, H.; Musio, F.; Baazaoui, H.; Aydin, O. U.; Hilbert, A.; Hamadache, R. E.; Yalcin, C.; Zhang, M.; Falcetta, D.; de la Rosa, E.; Shit, S.; Prabhakar, C.; Wittmann, B.; Rokuss, M. R.; Kirchhoff, Y.; Al-Maskari, R.; Hoeher, L.; Juchler, N.; Casamitjana, A.; Cleary, J.; Schmick, A.; Baumgartner, P.; Deseoe, J.; Vandans, O.; Lee, D.; Oh, K.; LaBella, D.; Mazher, M.; Niederer, S. A.; Qayyum, A.; Liu, Y.; Chen, J.; Kim, W.; Asawalertsak, N.; Kim, M.; Shin, D.; Park, S.-H.; Kikuchi, S.; Zhang, Y.; Liu, J.; Cui, Y.; Qiu, Y.; Verschuur, A.; Zhang, J.; van der Schaaf, I.; Su, R.;

2026-05-30 radiology and imaging 10.64898/2026.05.28.26354312 medRxiv
Top 0.1%
6.4%
Show abstract

We present the TopBrain 2025 Challenge, the first benchmark for fine-grained multiclass segmentation of the whole brain vasculature in both computed tomography angiography (CTA) and magnetic resonance angiography (MRA). Building on the TopCoW challenge, TopBrain scales vessel annotation from the Circle of Willis to the entire brain, introducing a dataset of 90 annotated volumes across 48 landmark vessel classes spanning arterial and venous systems, of which 50 training volumes are publicly released. Vessel definitions were consolidated from established neuroanatomical references into a unified annotation scheme, and vessel caliber measurements along the centerline are reported for the first time across the whole brain vascular anatomy. To address the unique challenges of multiclass brain vessel segmentation, we propose an evaluation framework that accounts for detection in segmentation performance, assesses anatomical plausibility, and introduces novel contamination metrics that characterize inter-class prediction errors. Fifteen teams from over 220 registered participants submitted algorithms to the benchmark. The top-performing teams built on nnUNet with principled system design choices, achieving around 80% Dice scores, near-zero invalid neighbor counts, over 60% F1 scores for side-road vessels, and below 18% foreground contamination ratio. Larger vessels are easier to segment, while smaller and more complex vessels remain the true bottleneck. The annotated datasets and podium-finish algorithms are made publicly available on Zenodo.

17
Explicit representation of germline and non-germline residues improves antibody language modeling

Kim, J.; Blalock, N.; Kulkarni, A.; Nakamura, K.; Romero, P. A.

2026-05-11 immunology 10.64898/2026.05.06.723387 medRxiv
Top 0.1%
6.3%
Show abstract

Antibodies originate from germline templates and are diversified by somatic hypermutation, producing sequences in which conserved germline residues scaffold structure while rare non-germline (NGL) substitutions refine antigen binding. Current antibody language models (ALMs) treat all residues equivalently and inherit a germline bias that systematically down-weights functionally critical NGL mutations as statistical noise. We introduce PRISM, a germline-aware ALM that explicitly represents germline and nongermline residues as distinct token types over a factorized 53-token vocabulary. PRISM achieves state-of-the-art pseudo-perplexity in hypervariable CDRs and is uniquely positively correlated with experimental binding affinity across three deep mutational scanning landscapes on which all compared ALMs anti-correlate. The dual-vocabulary further enables property-specific controllable generation previously unattainable with entangled ALMs. NGL-directed sampling improves physics-based binding scores while GL-directed sampling preserves stability and solubility. These results establish disentangled germline/non-germline representation as a substantive advance in antibody language modeling.

18
Spatiotemporal graph neural networks reveal conformational binding signature in protein dynamics

Motta, S.; Santini, G.; Mansoor, S.; Nezhad, F. H.; Meli, M.; Pandini, A.

2026-05-21 biophysics 10.64898/2026.05.19.726195 medRxiv
Top 0.1%
6.1%
Show abstract

Biomolecular function is often controlled by structural and dynamical adaptations to binding events. Although molecular dynamics (MD) simulations can capture these events at atomic resolution, separating functional signatures from stochastic noise remains challenging. Traditional methods often struggle to isolate mechanistically relevant differences across independent replicas. Here, we introduce an explainable deep learning approach that learns state-specific dynamic signatures directly from MD trajectories. By coupling a dynamic protein graph representation with group-aware contrastive learning across independent replicas, the model detects the signatures, filtering out trajectory-specific correlations. An explainable AI framework then maps the identified differences on individual residues. We demonstrate this approach by identifying "binding-ready" conformations in a T4-Lysozyme mutant, recovering the allosteric determinants of peptide recognition in the PDZ3 domain, and isolating a ligand-independent activation signature for the A2A receptor. Our GISTnet-MD method generalizes across unseen data during comparative MD analysis, translating raw trajectory differences into residue-level determinants of protein function.

19
HHBayes: A Flexible Bayesian Framework for Simulating and Analyzing Household Transmission Dynamics

Li, K.; Hou, Y.; Mukherjee, B.; Pitzer, V. E.; Weinberger, D. M.

2026-04-03 infectious diseases 10.64898/2026.04.01.26349903 medRxiv
Top 0.1%
5.1%
Show abstract

Household transmission studies are important for understanding infectious disease transmission and evaluating interventions; however, they are frequently constrained by methodological challenges, including in study design and sample size determination, and in estimating parameters of interest after collecting the data. Existing tools often lack flexibility in modeling age-specific susceptibility, infectivity patterns, and the impact of interventions such as vaccination or prophylaxis. Here, we develop HHBayes, an open-source R package that provides a unified framework for simulating and analyzing household transmission data using Bayesian methods. The package enables researchers to: (1) simulate realistic household transmission dynamics with highly customizable variables; (2) incorporate viral load data (measured in viral copies/mL or cycle threshold values) to model time-varying infectiousness; (3) estimate age-dependent susceptibility and infectivity parameters using Hamiltonian Monte Carlo methods implemented in Stan; and (4) evaluate intervention effects through user-defined covariates that modify susceptibility or infectivity. We demonstrate the capabilities of the package through simulation studies showing accurate parameter recovery and applications to seasonal respiratory virus transmission, including the impact of vaccination and antiviral prophylaxis on household attack rates. HHBayes addresses a critical gap in infectious disease epidemiology by providing researchers with accessible tools for both prospective study design and retrospective data analysis. The flexibility of the package in handling complex household structures, time-varying infectiousness, and intervention effects makes it valuable for studying diverse pathogens.

20
Quantitative extrapolation from single-tags (QuEST) immunofluorescence microscopy to derive TCR signalosome stoichiometries in human primary T cells

Fei, P.; Dustin, M. L.

2026-03-31 immunology 10.64898/2026.03.28.715001 medRxiv
Top 0.1%
4.9%
Show abstract

Upon T cell receptor (TCR) engagement, a T cell forms an immunological synapse (IS) with an antigen-presenting cell (APC), which can be mimicked by purified ligands on supported lipid bilayers (SLBs)1,2. Microvilli actively scan the surface; upon initial engagement, F-actin-dependent TCR microclusters form, and the central supramolecular activation cluster (cSMAC) sustains TCR signaling in CD8 T cells3,4. Although signaling activities within the IS have been observed qualitatively through total internal reflection immunofluorescence microscopy5-7, the stoichiometric relationships among the components of the TCR signalosome remain unknown. In this study, we employed a two-step approach to quantify the components of the TCR signalosome. First, Jurkat cell lines expressing GFP-tagged proteins on a knockout background were used to calibrate fluorescence intensity (IF) signals against molecular copy numbers, based on measurements of single-tag signals and multiple corrections. In the second step, this calibration was applied to determine the stoichiometries of key TCR signalosome components, including TCR, CD8, CD28, CD45, PD-1, Lck, ZAP-70, LAT, and PLC{gamma}1, across scanning, early activation, and sustained activation states in human primary T cells. We refer to the method as quantitative extrapolation from single-tags (QuEST) immunofluorescence microscopy. Applying the QuEST, we were surprised to find that the ZAP-70:TCR ratio in microclusters and the cSMAC was 1:1, far from the potential 10:1 ratio. Nanoscale structures of the TCR signalosome were further captured using direct stochastic optical reconstruction microscopy (dSTORM), confirming that ZAP-70 was strongly co-localized with the TCR. Moreover, we applied QuEST to confirm the presence of T cell intrinsic CD28 recruitment, independent of CD80 or CD86 on SLBs, during TCR activation. This T cell intrinsic CD28 recruitment could be disrupted through engagement of PD-1 with PD-L1 on SLBs. This shows that PD-1 engagement can disrupt T cell intrinsic CD28 costimulation. QuEST provides a broadly applicable pipeline for quantitative analysis of TCR signalosomes in human primary cells, enabling a quantitative basis for the rational manipulation and engineering of the TCR signalosome in immunotherapies.